Introduction

Our world is rapidly urbanizing in practically all geographies. In 2018, the UN reported that 55% of the global population lived in urban areas and projected that number to rise to 68% by 2050. With this urbanization, access to green space and parkland is certain to decrease across geographies and across populations. With this general decline in access to nature, there is likely to be an increase in the equity of access, with the wealthy and privileged losing access at a lesser rate than the poor and underprivileged. There has been work done to assess the public health impacts of this inequity in the access to nature in the context of COIVD (Spotswood et al., 2021), as well as work done on the “luxury effect,” which shows that wealthy neighborhoods can support higher biodiversity in some cases due largely to the increased presence of park land (Magel et al., 2021). In this analysis, I seek to understanding the relationship between access to green space and exposure to potentially harmful pollutants. I will also examine the relationship between green space and income. The analysis will specifically be done in the Bay Area of California, a collection of 9 counties surrounding San Francisco Bay.

Intuitively, it makes sense to hypothesize that, in an urban setting like the Bay Area, there would be a relationship between exposure to pollution and contamination and the relative urbanization of a location. We see natural landscapes as clean and urban hardscapes as dirty, and can make assumptions as to which location is healthier to live next to. Below I quantify that relationship using the spatial distribution of parks in San Francisco, summary data from the CalEnviroScreen model, and census block, block group, and tract geometries.

Methods

In developing this analysis, 4 key datasets are leveraged: 1) US Census American Communities Survey (ACS) Income Dataset, 2) US Census Tiger Geometries, 3) CalEnviroScreen 4.0, 4) Public Lands Trust Parks Dataset.

Geospatial Analysis

In this analysis, geospatial work is performed on multiple scales, and scaled up or down according to the needs of a particular section of the analysis. In each of the below summaries of the individual datasets, I have described scaling needs and procedures as applicable. For spatial relationships, all distances are presented in linear feet.

ACS Income

Table B19001 from the US Census Bureau is used to determine income levels in each census BlockGroup. Income levels in this table are presented as discrete counts of the population within certain income bands. To account for this unique arrangement of data, we define our variable of “income” for later regressions as the percent of respondents to the survey that make more than $100k annually (USD). The threshold of $100k is chosen a) as a clean break that is well-defined by the natural arrangement of the income categories and b) because the mean annual income of the Bay Area is roughly $100k.

Census Geometries

Originally, this analysis was intended to be performed with building footprints provided by the city of San Francisco. However, for practical purposes, census geometries were chosen instead. First, the building footprints dataset is extremely large and detailed, and very long processing times presented a limitation early in the development of this analysis, prompting the switch to a dataset with lower resolution and greater spatial extent. Second, the building footprints included residences, commercial structures, government structures, and retail structures. Given the urban setting, many (possibly most) of the structures that were residential were multi-family, which would bias the analysis against people living in apartments, duplexs, etc. Third, because CalEnviroScreen and the ACS income data are presented at the Census Tract and BlockGroup levels, respectively, the building footprints would need to be rolled up to lesser spatial extents anyway. Fourth, because of the roll-up requirement and the limited extent of the building footprints being used (San Francisco), when rolling up to Census geometries, sample size would be diminished for later regressions.

Therefore, the decision was made to focus on Census BlockGroups (for the ACS Income analysis) and Census Tracts (for the CalEnviroScreen analysis) and to expand the analysis to the entire Bay Area of California. This allows for a more generalizable analysis and a more meaningful sample set. When developed, the Blocks dataset was filtered to eliminate all Blocks with a land area value of 0.0 to eliminate the blocks in San Francisco that are all water area. Following this filter, the centroid was calculated for each Block, and this centroid was used to calculate the distance between a Block and the nearest park boundary. For the BlockGroups, the distances were averaged across the Blocks contained, and this average was assigned as the distance for that BlockGroup. For Tracts, the distances were again averaged across the BlockGroups.

CalEnviroScreen

CalEnviroScreen 4.0 is used as a simplified index of general environmental contaminant exposure. While the index incorporates a variety of stressors and pollutants, the summary index was chosen as the most appropriate indicator, given the wide variety of communities and potential exposures throughout the sample area. The CalEnviroScreen 4.0 data is presented at the Census BlockGroup level, and therefore the summary value is compared against the mean distance to a park from the Block centroids contained within each BlockGroup.

The CalEnviroScreen 4.0 summary score is a modeled summary of 21 indicators that are designed to account for 4 risk areas. In the risk area of Exposures, the score summarizes exposure of the population to Ozone, PM2.5, Diesel PM, Drinking Water Contaminants, Toxic Facility Releases, Traffic, Housing Lead Risk to Children, and Pesticides. The Environmental Effects risk area presents the exposure to Cleanup Sites, Solid Waste Facilities, Groundwater Contamination, Impaired Water Bodies, and Hazardous Waste. The sensitive Populations risk area assesses the prevalence of Asthma, Cardiovascular Disease, and Low Birth Weight Infants in a population. Finally, the risk area for Socioeconomic Factors incorporates rates of Education Attainment, Housing Burden, Linguistic Isolation, Poverty, and Unemployment. Together, these 21 indicators are summarized in a model according to this report, and a summary score is generated. This summary score, while limited in its utility as all indexes are, presents a useful metric for understanding general risk across various and diverse geographies, like those examined in this analysis. Moreover, the CalEnviroScreen summary score has been codified into California State policy, such as SB 535, which uses the summary score to designate disadvantaged communities. Precedent exists, therefore, for using this summary score across large geographies, and it is useful to understand these relationships using a metric that is formally adopted as a decision-making tool by the California government.

Trust for Public Lands ParkServe Dataset

Park boundaries were downloaded and incorporated as shapefiles provided by the Trust for Public Lands ParkServe Program. No modifications were done to the park boundaries dataset (with the exception of re-projection), and all data was retained for the nine counties encompassing the Bay Area. Importantly, data were not included for the counties bordering the Bay Area, and therefore it is possible that Blocks on the outer boundary of some Bay Area counties were incorrectly attributed with a “nearest park” that was erroneously far from them if the actual nearest park is in a county that was not included. This is assumed to be a negligible portion of the analysis, given the sample size.

General Workflow

  1. Load park geography/geometry shapefiles and clip to Bay Area Counties
  2. Load Census Blocks and BlockGroups for Bay Area/SF
  3. Convert Blocks to Centroids
  4. Calculate linear distance from each Block centroid to the nearest park boundary and attribute Blocks as such
  5. Within each BlockGroup, take the average linear distance for the Blocks within the tracts
  6. Assign the average distance to the park boundary as an independent variable for later regression analyses
  7. Calculate the percentage of the population with an income over $100k for BlockGroups, and attribute Tracts with CalEnviroScreen summary data
  8. Regression - Regress average distance to parks against Income and CalEnviroScreen

Fig. 1

A map of all Park Lands in the Bay Area, California, according to the Trust For Public Lands ParkServe Data Program

Results

Presented below are the result of our Regression Analyses below in Figures 2-10 and in the accompanying model results summaries.

Income versus Distance to Parks

In Fig. 2, we can see the scatterplot of the percentage of a census BlockGroup that has an annual income greater than $100k USD plotted against the average linear distance (feet) to the nearest park from the centroid of census blocks within a BlockGroup. Below the scatterplot we can see the model output summary, which shows an R-squared of near zero (0.0001) and a p-value of 0.53, indicating that this predictive relationship is not significant (p > 0.05). After analysis of the residuals plotted in Fig. 3, there is a noticeable skew, with the mean and median appearing to be nearer to 20 than 0, which indicates that this simple linear fit may not be the most appropriate.

Therefore, a log-transformation was performed, where the natural log of income (Perc100k) is plotted against distance in Fig. 4. A summary of this log-transformed model output is also included below the figure, which shows that this relationship is also not statistically significantly predictive, with a p-value 0.422. Fig. 5 shows that the residuals of this log-transformation are centered more closely around 0.0, however they now show even more significant skew to the left.

Based on both of these models, we cannot conclude that there is a statistically significant relationship between the percentage of annual income in a census BlockGroup that exceeds $100k USD and the distance of that BlockGroup to the nearest park boundary.

Fig. 2

Scatterplot and Line of Best Fit for the linear regression between the percentage of a census Block with an income over $100k annually and the linear distance (feet) between the centroid of that Block and the nearest Park boundary. Below is a summary of the model outputs:

## 
## Call:
## lm(formula = Perc100k ~ Length, data = Blocks_100k_log_analysis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.477 -15.182   1.897  15.808  48.132 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.5838107  0.7426227  70.808   <2e-16 ***
## Length      -0.0004045  0.0006441  -0.628     0.53    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.24 on 3918 degrees of freedom
## Multiple R-squared:  0.0001006,  Adjusted R-squared:  -0.0001546 
## F-statistic: 0.3944 on 1 and 3918 DF,  p-value: 0.5301

Fig. 3

Density plot of the residuals for the linear model. Residuals are skewed right, indicating that a linear model for this relationship is likely not the best fit.

Fig. 4

Log-tranformed scatterplot and Line of Best Fit for the linear regression between the natural log of the percentage of a census Block with an income over $100k annually and the linear distance (feet) between the centroid of that Block and the nearest Park boundary. Below is a summary of the model outputs:

## 
## Call:
## lm(formula = log(Perc100k) ~ Length, data = Blocks_100k_log_analysis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2817 -0.2370  0.1434  0.3728  0.7688 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.860e+00  1.936e-02 199.353   <2e-16 ***
## Length      -1.349e-05  1.680e-05  -0.803    0.422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5277 on 3918 degrees of freedom
## Multiple R-squared:  0.0001647,  Adjusted R-squared:  -9.053e-05 
## F-statistic: 0.6453 on 1 and 3918 DF,  p-value: 0.4219

Fig. 5

Density plot of the residuals for the log-tranformed linear model. Residuals are roughly normally distributed, demonstrating that a linear regression is likely an appropriate model to understand the relationship

CalEnviroScreen versus Distance to Parks

In Fig. 6, we can see the scatterplot of the CalEnviroScreen 4.0 summary score, as described above, plotted against average linear distance (feet) to the nearest park from the centroid of census blocks within a census Tract. Below the scatterplot we can see the model output summary, which shows an r-squared near zero (0.002). Despite this low r-squared, however, we do see a p-value that is marginally statistically significant (p = 0.0645), though not less than 0.05. With this marginal significance and ~93% confidence level, it can be concluded that, for each 1000 linear feet nearer to a park a census tract is, we can expect to see a decrease in 1.43 points in the CalEnviroScreen 4.0 summary score.

Upon analysis of Fig. 7, however, we can see that the residuals are skewed to the right and not centered about 0.0, so a log-transformation was performed to address the potentially poor fit of a simple linear regression for this model. Fig. 8 shows the scatterplot of this log-transformation, and the summary of the model output is included below it. While we still show a near zero r-squared value (0.0027), we again return a statistically significant p-value for the predictive relationship between the summary score and distance to the nearest park. With the log-transformation, our p-value becomes 0.0415, within the generally accepted range of statistical significance (p < 0.05). Therefore, at a >95% confidence, we can say that for each 1000 linear feet nearer to a park a census tract is, we can expect to see a decrease in 0.09677 in the natural log of the CalEnviroScreen 4.0 summary score.

With this marginal decrease in p-value and almost no change in the r-squared value, this log-transformation of our relationship is a potential example of p-hacking, manipulating data to find a pre-determined desired level of significance.

Fig. 6

Scatterplot and Line of Best Fit for the linear regression between the summary score of the CalEnviroScreen 4.0 and the linear distance between the mean distance (feet) between Blocks contained within the BlockGroup and the nearest Park boundary. Below is a summary of the model outputs:

## 
## Call:
## lm(formula = Score ~ Length, data = CES4_Bay_Tracts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.050  -9.185  -2.688   7.295  46.752 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.0993755  0.8443874   23.80   <2e-16 ***
## Length      -0.0014341  0.0007751   -1.85   0.0645 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.03 on 1556 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.002195,   Adjusted R-squared:  0.001554 
## F-statistic: 3.423 on 1 and 1556 DF,  p-value: 0.06449

Fig. 7

Density plot of the residuals for the linear model. Residuals are skewed right, indicating that a linear model for this relationship is likely not the best fit.

Fig. 8

Scatterplot and Line of Best Fit for the linear regression between the summary score of the CalEnviroScreen 4.0 and the linear distance between the mean distance (feet) between Blocks contained within the BlockGroup and the nearest Park boundary. Below is a summary of the model outputs:

## 
## Call:
## lm(formula = log(Score) ~ Length, data = CES4_Bay_Tracts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50878 -0.44396  0.07815  0.56054  1.53289 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.788e+00  5.168e-02   53.95   <2e-16 ***
## Length      -9.677e-05  4.744e-05   -2.04   0.0415 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7362 on 1556 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.002667,   Adjusted R-squared:  0.002026 
## F-statistic: 4.161 on 1 and 1556 DF,  p-value: 0.04154

Fig. 9

Density plot of the residuals for the log-transformed linear model. Residuals are roughly normally distributed, demonstrating that a log-tranformed linear regression is likely an appropriate model to understand the relationship

Multiple Regression with Log Transformation

In past analyses, we have established that there is a significant relationship between various sub-metrics of the CalEnviroScreen dataset and income in the Bay Area. Therefore, it is useful to attempt a multiple regression in order to examine the relationship between CalEnviroScreen summary scores and distance to parks, controlled for income. Fig. 10 below shows the residuals of this plot, and above Fig. 10 we can see the summary output of this log-transformed multiple regression. Unsurprisingly, we see an extremely significant statistical relationship between CalEnviroScreen summary score and the percentage of a census tract that is earning more than $100k USD annually (p < 2e-16). Interestingly, we also see a marginally significant predictive relationship between the summary score and distance to parks, with the p-value identical (0.0645) to the simple linear regression p-value. We can conclude that, with a ~93% confidence interval, for each 1000 linear feet nearer to a park a census tract is, we can expect to see a decrease in 0.0612 in the natural log of the CalEnviroScreen 4.0 summary score. The results of this multiple regression are visualized in a 3-dimensional scatterplot in Fig. 11

Below is a summary of a log-transformed multiple regression model [Score ~ Length + Perc100k], which plots the relationship between distance to the nearest park and CalEnviroScreen Summary score, controlling for income in the census BlockGroup (as percentage making $100k USD per year or more). The summary below shows that there is a significant predictive relationship between CalEnviroScreen summary scores and income (p < 2e-16), however the relationship between distance to the nearest park and CalEnviroScreen summary score still was not significant (p = 0.138). That said, this p-value is much closer to significance than that of any of the above relationships, so the model does have a predictive capacity closer to significance.

## 
## Call:
## lm(formula = log(Score) ~ Length + Perc100k, data = SF_100k_Score_Sum)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58260 -0.29339  0.02902  0.33606  1.53315 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.269e+00  5.284e-02   80.79   <2e-16 ***
## Length      -6.017e-05  3.252e-05   -1.85   0.0645 .  
## Perc100k    -2.927e-02  7.569e-04  -38.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5056 on 1380 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.5215, Adjusted R-squared:  0.5208 
## F-statistic:   752 on 2 and 1380 DF,  p-value: < 2.2e-16

Fig. 10

Density plot of the residuals for the log-transformed linear multiple regression model. Residuals are roughly normally distributed, demonstrating that a log-tranformed linear regression is likely an appropriate model to understand the relationship

Fig. 11

3-Dimensional plot of the multiple regression, with linear distance (feet) to the nearest park on the x-axis, CalEnviroScreen Summary score on the y-axis, and income (Perc100k) on the z-axis

Conclusion

In conclusion, when assessing significance based on a p-value threshold of 0.05, the only statistically significant relationship that was shown in this analysis was that between the natural log of CalEnviroScreen scores and linear distance to the nearest park. Even that relationship could be argued as an instance of p-hacking, with the non-log-transformed regression showing a marginally significant result. Several explanations exist that may account for this surprising result. In examining Fig. 1, we can see that parks in the Bay Area are distributed relatively uniformly throughout the region, and it is therefore likely that the distribution of distances to the nearest park would also be relatively uniform. In fact we do see this in our scatterplots, where the distribution of points is spread relatively evenly across the plots. What is likely more important, and presents potential future research, is the introduction of quality or quantity measures into the analysis. While access to some park may be relatively uniform, access to well maintained, safe, and enjoyable parks could be a different story. While outside the scope of this analysis, sub-setting parks by quality could be a valuable strategy to more accurately examine whether a community has genuine access to nature.

Within the scope of this analysis, in order to examine whether different subsets of parks have a greater or lesser effect in different geographies across the Bay Area, I have developed a dashboard which allows a user to select one or more park owners (Federal Government, NGO, county, etc.) and a county within the Bay Area, and perform a regression to predict the relationship between income (Perc100k) and distance to the nearest park of the selected category. The dashboard also displays a map of the selected parks in the county of interest to demonstrate the spatial scale and density of the regression analysis. Of note, the dashboard operation is extremely slow due to the processing flow built into it. In future versions of this analysis, the workflow would optimized so that less processing is done in the Shiny interface. The dashboard can be accessed at the link below:

DASHBOARD LINK

Direct Link: https://roconnor.shinyapps.io/FInalProject_Dashboard/